I have been watching anime for many years, it started with Accel World and Nanatsu No Taizai on Animax( which has ceased broadcasting in India). I personally think that the anime community has come a long way from lousy animations to one of the best animated projects ever. So, I have analysed some trends based on the available dataset.
Anime has been very popular since a long time, especially due to their deep storylines and the bond it creates between you and the character. I will be using the Anime Recommendations Database(Which has user ratings from the famous website MyAnimeList( MAL)) dataset which is available on kaggle publically for performing this analysis. I thoght of scrapping from mal but in Terms of Use Agreement under “User Content” last para it was mentioned that scraping was not allowed without prior consent.
This data set contains information on 12,294 anime. The ratings are on a scale of 1 to 10. Top 10 enteries being the following:
I will also be scraping ratings from imdb to comapre it with mal to see is there any relation between the two or is one more prefered than the other?
For data scraping we used the famous imdb website to get ratings and votes of top animes.
Then created a dataframe with names, ratings and votes, and saved it as a Rdata object.
Apart from Null values, there was not much data cleaning to be done for this dataset. But, since the data was extracted directly from the website, the names of the anime where not in the proper format.
## [1] "Kimi no Na wa." "Fullmetal Alchemist: Brotherhood"
## [3] "Gintama°" "Steins;Gate"
## [5] "Gintama'"
Some were having special characters, while some were in romaji. Which made comparision between imdb ratings and mal ratings very inaccurate.
So to improve this I took another dataset from kaggle which had anime ratings from Anime Planet having around 18500 anime titles and 17 col names:
## [1] "Rank" "Name" "Japanese_name" "Type"
## [5] "Episodes" "Studio" "Release_season" "Tags"
## [9] "Rating" "Release_year" "End_year" "Description"
## [13] "Content_Warning" "Related_Mange" "Related_anime" "Voice_actors"
## [17] "staff"
Of which took only the 2, 3, 6, 10 columns.
Then I used “which( %in% )” to identify common titles between names of mal dataset and english names of anime planet dataset, and similarly did the same between mal and japanese names of anime planet.
On the basis of indexes I got from above, I created a new combined dataset with release year and both eng and jap titles from anime planet dataset with ratings and all from mal dataset and called “main_dat” and saved it as “Final_data.RData” file.
Did the similar procedure to find common titles between imdb and “main_dat”.
Now that we have the final datasets, let’s try to visually analyze the datasets. I will be using several different kinds of plots to visualize the data. Following are the different types of charts we will be covering in the subsequent section.
As per the bar plot above we can see that “Comedy”, “Action”, “Adventure” and “fanatsy” are the top genres.
To visualize the distribution of counts of a categorical variable, we can use piechart to better understand the relative proportions in a percentage-wise manner. Let’s see the pie of types of anime produced:
We can see that TV, OVA and Movies are predominant in the anime industry whereas Specials, ONA and music anime are relatively less in number.
Using Plotly library I created an interactive plot of “imdb ratings” on y-axis and “mal ratings” on x-axis, and also plotted a y=x line to divide graph in two halves, upper triangle has imdb rating > mal ratings, and lower triangle the opp.
You can also drag and select an area to zoom in to that area, hovering on points will display additional info about them.
From the above graph we see that more than half of them are in the lower triangle, a reason could be the number of users voting, if you hover on a point you can see the number of votes each anime got and in all animes number of votes is far greater than in imdb, so one can say that the anime ratings on mal is more refine and accurate.
So we conclude that even though IMDB ratings are best for holywood movies and web series but, not that good for animes.
Success of an anime also depends on which studio is animaing it i.e. the quality of animation. In order to compare studios we gave them a rating equal to the mean of all the animes(present in our dataset) they produced. Let’s take a look at bar plot of few of them:
After looking at the graph above one might think that more the avg rating better the studio, but it doesn’t depend on only avg rating, let’s take a look at number of anime produced by a studio(y-axis) v/s their avg rating (on x-axis):
As their were many studios with only 1-2 animes produced which made the plot congested, so removed studios with less than 4 animes produced. You can also hover on a point to see more details about it.
We see that a studio with more number of anime produced and more avg rating can be considered as top studios.(According to the data set we got)
Some of the top studios are:
When we need to study the variation between two continuous variables, we resort to using scatterplots. They help us understand the correlation between these two variables i.e. if one variable increases, does the other increase, decrease or stay unaffected.
Let’s see the scatter plot between number of votes and ratings
From the above plot we see that there is an exponential increase in number of votes around 7 rating, with this we can say that user is more likely to watch anime with rating higher than 6-7 which make them even more popular, this can be seen as selection bias.
The most challenging thing of carrying out the analysis is to understand the data. If the data is not well understood, it becomes very difficult to perform the analysis because there is no clear direction on what to do on the data.
Next, data preparation and cleaning is also important in data analysis to eliminate the less meaningful data in order to boost the analysis.
Overall, I enjoyed doing data analysis on something I like so much.